Deep generative model of the distal tibial classic metaphyseal lesion in infants: assessment of synthetic images

Abstract Background The classic metaphyseal lesion (CML) is a distinctive fracture highly specific to infant abuse. To increase the size and diversity of the training CML database for automated deep-learning detection of this fracture, we developed a mask conditional diffusion model (MaC-DM) to generate synthetic images with and without CMLs. Purpose To objectively and subjectively assess the synthetic radiographic images with and without CMLs generated by MaC-DM. Materials and Methods For retrospective testing, we randomly chose 100 real images (50 normals and 50 with CMLs; 39 infants, male = 22, female = 17; mean age = 4.1 months; SD = 3.1 months) from an existing distal tibia dataset (177 normal, 73 with CMLs), and generated 100 synthetic distal tibia images via MaC-DM (50 normals and 50 with CMLs). These test images were shown to 3 blinded radiologists. In the first session, radiologists determined if the images were normal or had CMLs. In the second session, they determined if the images were real or synthetic. We analyzed the radiologists’ interpretations and employed t-distributed stochastic neighbor embedding technique to analyze the data distribution of the test images. Results When presented with the 200 images (100 synthetic, 100 with CMLs), radiologists reliably and accurately diagnosed CMLs (kappa = 0.90, 95% CI = [0.88-0.92]; accuracy = 92%, 95% CI = [89-97]). However, they were inaccurate in differentiating between real and synthetic images (kappa = 0.05, 95% CI = [0.03-0.07]; accuracy = 53%, 95% CI = [49-59]). The t-distributed stochastic neighbor embedding analysis showed substantial differences in the data distribution between normal images and those with CMLs (area under the curve = 0.996, 95% CI = [0.992-1.000], P < .01), but minor differences between real and synthetic images (area under the curve = 0.566, 95% CI = [0.486-0.647], P = .11). Conclusion Radiologists accurately diagnosed images with distal tibial CMLs but were unable to distinguish real from synthetically generated ones, indicating that our generative model could synthesize realistic images. Thus, MaC-DM holds promise as an effective strategy for data augmentation in training machine-learning models for diagnosis of distal tibial CMLs.


Introduction
Child abuse is a major public health problem.Infants (≤1 year old) are particularly vulnerable and constitute 42.4% of children who die from inflicted injuries. 1Most are found to have fractures at the time of death.][8][9] The CML can pose a diagnostic challenge to radiologists because of its variable and frequently subtle radiographic appearances.This injury can occur in any of the long bones, but the distal tibia is one of the most common sites. 7,10,11In a prior study, a machine-learning algorithm was used to detect distal tibial CMLs with a sensitivity and specificity of 88% and 96%, respectively. 12One of the limitations of this prior study was a cohort of only 73 CML images, which were carefully curated from a 12.5-year experience at a large children's hospital.One potentially effective strategy to overcome this limitation in model training is to use a data augmentation methodology to increase the size and diversity of the CML database. 13][16] However, these models can be difficult to train and may generate low-quality and repetitive images when limited data are available. 17Recently, a new class of generative model called denoising diffusion probabilistic model has been shown to be effective in generating realistic-appearing synthetic images and has a more straightforward training process. 18Furthermore, the denoising diffusion probabilistic model has the ability to incorporate conditional information, including class labels, to generate high-quality conditional images. 18We have developed a mask conditional diffusion model (MaC-DM) that incorporates the segmentation masks of the distal tibias and the CML fractures as conditions in generating synthetic images of the distal tibia with and without CMLs. 19he next step in moving toward implementation of this approach is to test the verisimilitude of these synthetic images via two approaches: (1) objectively by using a nonlinear dimensionality reduction technique to compare and analyze the feature distribution of the images; and (2) subjectively by asking the expert opinions of radiologists who are readily able to differentiate those distal tibias with CMLs from those without CMLs.Therefore, the aim of this study was to objectively and subjectively evaluate the verisimilitude of the synthetic distal tibial images with and without CMLs that were generated by MaC-DM.

Methods
Our local institutional review board approved this retrospective study.This study was compliant with the Health Insurance Portability and Accountability Act guidelines, and informed consent was waived.

Generative model
The methodology we used to generate the synthetic distal tibial radiographs for this investigation is detailed in a prior report. 19Briefly, MaC-DM is based on the denoising diffusion probabilistic model 18 and involves explicitly adding random noise to real images and training a deep learning model to progressively remove this noise.We modified this denoising diffusion probabilistic model by incorporating segmentation masks of the CML fracture region and the distal tibia as conditional information.Our MaC-DM uses these segmentation masks along with a classifier guidance technique 18 to generate class-specific synthetic radiographs (ie, with or without CML fractures).The detailed architecture of MaC-DM is shown in Figure S1.

Study material
We used a previously assembled database of distal tibial images with and without CMLs 12 collected over a 12.5-year period (January 2009-May 2021) from a large tertiary children's hospital.The images were tightly cropped and manually aligned anteroposterior radiographs of the distal tibia from infants with suspected abuse.Specifically, the original 1024 × 1024 grayscale tibial radiographs were cropped to the distal fifth of the tibia using MATLAB 8.3 (MathWorks, Natick, MA), resized to 256 × 256 pixels, and saved in JPEG format by an experienced pediatric radiologist (A.T.).There were 2 different sets of images curated in this database: distal tibial radiographs without CMLs (the normal dataset) and with CMLs (the abnormal dataset).The inclusion criteria for the normal dataset were: (1) negative skeletal survey (excluding simple skull fracture); (2) a clinical history of head trauma that initiated the skeletal survey; (3) a concurrent head CT showing no significant intracranial injury (excluding focal extra-axial blood underneath a skull fracture, if any); (4) no concern for abuse by the child protection team; (5) no follow-up skeletal survey; and (6) no other risk factors for abuse.These infants were considered to be at low risk of abuse.The inclusion criteria for the abnormal dataset were: (1) positive skeletal survey (in addition to a distal tibial CML); (2) a final report indicating a distal tibial CML with subsequent confirmation by the senior author; (3) high-risk of abuse as determined by the child protection team; and (4) a follow-up skeletal survey.Selection of the normal and abnormal study cohorts are shown in Figure 1.
The resultant curated database consisted of 250 distal tibial radiographs.The normal subset of this database consisted of 177 radiographs from 89 infants (male ¼ 61, female ¼ 28; mean age ¼ 5.0 months, SD ¼ 3.3 months, range ¼ 0.2-11.6months).The abnormal subset of this database consisted of 73 radiographs from 35 infants (male ¼ 24, female ¼ 11; mean age ¼ 3.3 months, SD ¼ 2.9 months, range ¼ 0.4-12 months).From this curated distal tibia image database, we randomly chose 100 real images (50 normal and 50 with CMLs) from 39 infants (male ¼ 22, female ¼ 17; mean age ¼ 4.1 months; SD ¼ 3.1 months) to include in our testing database.Next, for each real normal image in our curated database (N ¼ 177), we used MaC-DM to generate 1 synthetic normal and 1 synthetic CML image based on the input binary class condition (eg, normal, CML).We then randomly selected 50 normal and 50 CML images from this resulting pool of 354 synthetic images (177 real normal images × 2 synthetic images generated per real normal image).These synthetic images were 256 × 256 pixels in size and saved in JPEG format.In the end, our testing database consisted of 200 images (with equal numbers of real normal, real CML, synthetic normal, and synthetic CML images).

Latent feature distribution of the testing database
To visualize our complex high-dimensional imaging data, we used the t-distributed stochastic neighbor embedding (t-SNE). 20This nonlinear dimensionality reduction technique embeds high-dimensional data into a low-dimensional space (2 or 3 dimensions) for optimal data visualization, direct comparison, and statistical analysis of both real and synthetic images.For this technique, we first used the aforementioned real CML dataset (N ¼ 250, with 73 CML images and 177 normal images, all cropped and resized to 256 × 256 during the preprocessing step) to train a ResNet-34 model with 5-fold cross validation-using 80% of the data for training and 20% for testing in each fold. 19Then, the trained ResNet-34 model that achieved the best cross-validation performance was selected for feature extraction.We used the secondto-last layer (average pooling layer) of the trained ResNet-34 model to derive the 2-dimensional latent feature representations of the curated real CML dataset.Using this derived latent feature space, we applied the t-SNE method to our testing database (N ¼ 200) and projected the latent features of this database onto a 2-dimensional plane for visualization, comparison, and analysis.

Image interpretation
Three board-certified radiologists with 30 years (radiologist #1 [R1]), 27 years (radiologist #2 [R2]), and 14 years (radiologist #3 [R3]) of postfellowship experience in pediatric musculoskeletal imaging were blinded to patient demographics, diagnosis, study design, and image acquisition methods.In the first reading session, the radiologists independently determined if the randomly ordered distal tibial images within the testing database were normal or had CMLs.In this first session, they were not informed that the testing image database included synthetic images.In the second session conducted more than a month later, they were informed that the testing image database included synthetic images, and they were asked which of these same, rerandomized images were real and which were synthetic images.The radiologists viewed these images using Windows 10 Microsoft Photos.

Evaluation of latent feature space
Receiver operating characteristic (ROC) analysis was used to objectively compare the measured feature distributions in the latent space for (1) normal versus CML images and (2) real versus synthetic images.For each comparison, the area under the curve (AUC) statistic was obtained from the predicted probabilities of a logistic regression model fit by penalized maximum likelihood using SAS's LOGISTIC procedure (21).The predictors in this model were obtained from the twodimensional t-SNE features described previously.In this context, AUC measured the probability that the predicted values of one distribution was greater than another (AUC ¼ Prob(pre-d1 > pred2)).An AUC of 0.5 indicates two perfectly overlapping distributions; an AUC of 1.0 indicates two perfectly distinct distributions, and thus an AUC near 1.0 indicates "difference." The AUC analysis was supplemented by the parametric multivariate t-test (SAS's GLM procedure 21 ) and the permutational multivariate analysis of variance (PERMANOVA 22 ) nonparametric test to detect differences between the bivariate distributions.We conducted pre-and posttest power calculations to ensure that the 200 images in our testing database was sufficient to detect these differences with reasonable statistical power (details in Supplementary Material S2).

Evaluation of radiologists' interpretations
For the interpretation of the radiologists, we assessed interreader agreement via the kappa statistic (scale -1 to 1) using the following guideline 23 : <0.0 (poor agreement), 0.0-0.2(slight agreement), 0.2-0.4(fair agreement), 0.4-0.6 (moderate agreement), 0.6-0.8(substantial agreement), and 0.8-1.0(almost perfect agreement).To further assess inter-rater reliability, we calculated Krippendorff's Alpha (scale -1 to 1). 24,25or this reliability metric, we considered Alpha >0.8 as having good agreement. 24o assess the performances of each of the three radiologists, we constructed 2 × 2 contingency tables for each to assess their ability to distinguish (1) normal versus CML images and (2) real versus synthetic images.To combine results over the 3 radiologists, we estimated sensitivity and specificity using separate population average (marginal) models to accommodate the likely positive within-reader correlation. 26These models used an exchangeable variance matrix and were fitted using SAS's GLIMMIX procedure 21 .Accuracy was derived from these independent sensitivity and specificity estimates (details in Supplementary Material S4).In addition, we used hierarchical summary receiver operating characteristic methods 27 and utilized the meta-analysis tool 28 to compute the ensemble estimates of sensitivity and specificity.Hierarchical summary ROC accounts for within-reader correlation when combining results from the three different radiologists.
To assess within-image agreement, we analyzed how consistently and accurately the radiologists were in classifying each image.Specifically, for the two classification tasks (normal versus CML, and real versus synthetic), we collected the three radiologists' ensemble interpretation of each image, and grouped them into four categories: 0, no radiologist agreed with the ground truth; 1, one radiologist agreed with the ground truth; 2, two radiologists agreed with the ground truth; and 3, three radiologists agreed with the ground truth.We used this methodology as a surrogate to identify the perceived level of difficulty in classifying these images.

Results
Examples of real and synthetic distal tibial images with and without CMLs are shown in Figure 2. The three radiologists' corresponding assessments are shown in Supplementary Material S3.

Evaluation of latent feature space
Using t-SNE, we visually compared the data distributions of normal distal tibial images and those with CMLs, as well as the data distributions of those images that are real and synthetic (Figure 3).For the former, we observed two distinct clusters in the two-dimensional latent feature space, one corresponding to images with CMLs and one corresponding to normal images (Figure 3A).This distinct spatial clustering implied that the data distributions of these two image types were very different and would enable easy partitioning of these two clusters.ROC analysis (Figure 4B) indicated a substantial difference between the two groups (AUC ¼ 0.996, 95% CI ¼ [0.992-1.000]).These results were supported by multivariate tests for group difference (P < .001for both parametric t-test and nonparametric PERMANOVA).
In contrast, there was substantial overlap in the latent feature distributions of real and synthetic images, indicating similarity in their imaging appearances (Figure 3B).And as such, the two image types would be difficult to discriminate.ROC analysis (Figure 4B) showed that the difference between the two distributions was small, close to 0.5 (AUC ¼ 0.566, 95% CI ¼ [0.486-0.647]).Consistent with this finding, multivariate statistical tests for group difference were not statistically significant (parametric t-test [P ¼ .21],nonparametric PERMANOVA [P ¼ .11]).

Evaluation of radiologists' interpretations
Overall, reader agreement in differentiating normal images from those with CMLs were good/excellent (kappa ¼ 0.90, The ROC analysis (Figure 5A) shows the accuracy (AUC) based on correlated models that combined all readers.Radiologists accurately differentiated normal images from those with CMLs (accuracy ¼ 92%, 95% CI ¼ [89-97]).For the subset of real images, they achieved an accuracy of 90% with 95%   The ROC analysis (Figure 5B) shows the accuracy (AUC) based on correlated models that combined all readers.Radiologists were inaccurate in differentiating between real and synthetic images (accuracy ¼ 53%, 95% CI ¼ [49-59]. For the subset of normal images, they achieved an accuracy of 57% with 95% CI ¼ [42-55].For the subset of images with CMLs, they achieved an accuracy of 49% with 95% CI ¼ [49-59].These and additional results are shown in Tables 2-4.When the model did not converge because of extreme data (eg, when all readers achieved 100% sensitivity), the dataset was augmented with a single observation (eg, ground truth label ¼ 1, reader predicted label ¼ 0) for each of the three readers.

Within-image agreement analysis
Our analysis revealed a significant difference in accuracy and agreement between the two different classification tasks (Figure 6).In particular, for the radiologists' ensemble classification of normal versus CML images, >80% of the images were in category 3, indicating that these images were easily classified into normal versus CML by all three radiologists with few disagreements.In contrast, for the radiologists' ensemble classification of real versus synthetic images, approximately 25% of images were in category 3.In fact, the relatively even distribution across all four categories for this specific task indicated substantial disagreement among radiologists and to the ground truth.

Discussion
This is, to our knowledge, the first study to evaluate synthetic images created by a generative model in a deep learningbased infant abuse application.Through objective and subjective evaluations of the synthetic images, we found that they were of diagnostic quality and indistinguishable from real images.Specifically, we found that radiologists accurately diagnosed distal tibial images with and without CMLs.This highlights the ability of our diffusion models to generate synthetic images with class-specific features (ie, with or without CML fractures).In contrast, we found that all three radiologists could not reliably determine whether the distal tibial images were real or synthetic, achieving an accuracy ranging from 45% to 58%.This poor performance supports the visual realism achieved by our synthetic images.This overall performance of the radiologists is concordant with the results of the t-SNE analysis that showed that the data distributions of images with and without CMLs were very different, whereas the data distributions of real and synthetic images were quite similar.
Few prior research studies used radiologists to evaluate augmented images produced by generative models, including models based on GANs 29 and diffusion models. 30Prezja et al. proposed GAN-based models to generate synthetic X-ray images of knee joints to aid in classifying the severity of osteoarthritis. 29They found that radiologists could not accurately distinguish between real and synthetic images.More recently, Ali et al. used stable diffusion models trained with 3156 X-ray images to generate synthetic lung X-ray images. 30They noted that radiologists could accurately distinguish real from synthetic images but that recognizable artifacts on their synthetic images might have aided the radiologist.In contrast, our MaC-DM generated high-quality synthetic images with few and subtle artifacts that are indistinguishable from real images.
An important factor contributing to our success was the integration of segmentation masks within our conditional diffusion model, incorporating regional morphology of the CML fracture and the distal tibia.This strategic integration mitigates the necessity of an extensive training dataset that is generally assumed to be necessary for diffusion models to generate realistic images.
Our approach, which leverages MaC-DM's capabilities to generate synthetic medical images, has important implications for the field of health care.First, the use of synthetically generated medical images may reduce the reliance on large volumes of real patient data for research and development.For instance, recent studies have found that high-quality synthetic data could be used to improve surgical outcomes in robotassisted surgery, 31 to optimize patient-specific treatment plans, 32 and to improve health care models for disease detection. 33For our infant abuse application, synthetic data can be used to increase the size and diversity of our CML database to better train our deep learning-based models for CML detection.Second, realistic synthetic medical images provide a valuable resource for training medical professionals.In particular, by being able to synthetically generate CML fractures with variable patterns (eg,  E, and F).The following four categories were used: 0, no radiologist agreed with the ground truth; 1, one radiologist agreed with the ground truth; 2, two radiologists agreed with the ground truth; 3, three radiologists agreed with the ground truth.
focal vs diffuse or acute versus healing), these synthetic images offer a versatile platform for teaching health care professionals, in detecting CMLs, and in gaining valuable experience with a wide range of CML appearances.Third, synthetically generated medical images can mitigate patient privacy concerns, a particularly important factor in child abuse cases where informed approaches to medico-legal issues and patient confidentiality are vital.Health care providers and researchers can share data for collaborative, teaching, and learning purposes without compromising patient privacy. 34ur results should be considered in light of the following limitations.First, we lack a dataset of real images gathered from other medical facilities for external validation.Second, the images used to train MaC-DM require a preprocessing step that included manual orientation and cropping of skeletal survey radiographs by an experienced radiologist-a subjective and time-consuming task.In the future, we intend to use an object detection method, such as YOLO, 35 to improve the efficiently in collecting real images.Third, although our multivariate tests showed no statistically significant difference between real and synthetic images in the t-SNE space (parametric t-test [P ¼ .21],nonparametric PERMANOVA [P ¼ .11]),some quantifiable differences may be present.These might arise from the subtle variations in distal tibia structures or background textures.The multivariate tests suggest no significant gap, but minor inherent differences between real and synthetic images likely exist.Fourth, our study used binary response data (real/synthetic) for radiologists' opinions.A Likert grading scheme (eg, 1-3 synthetic, 4-6 unsure, 7-9 real) could capture larger degrees of uncertainty in the radiologists' opinions.Fifth, our study focused on the distal tibia and further research is required to determine if our approach is applicable to other anatomic regions where CMLs can occur.Finally, our study included only 200 test images with three experienced pediatric radiologists.A future study using a larger number of test images from multiple institutions and readers with more varied training and experiences would be needed to demonstrate the generalizability to our findings.

Conclusion
Based on objective and subjective evaluations, we found that the MaC-DM generated synthetic distal tibial images, with and without CMLs, that were realistic in appearance and of diagnostic quality.These results suggest that our MaC-DM is a promising strategy for data augmentation in training deep learning models for detection of distal tibial CMLs.

Figure 1 .
Figure 1.Flow chart illustrating the identification of the normal and the abnormal radiographs from which the testing cohort of 200 images (100 abnormal, 100 synthetic) were derived.

Figure 2 .
Figure 2. Examples of synthetic (A, B, C, D) and real (E, F, G, H) distal tibial radiographic images, with classic metaphyseal lesions (CMLs) (C, D, G, H) and without CMLs (A, B, E, F).The interpretations of these eight distal tibial images by the three radiologists are presented in Supplemental Material S3.

Figure 3 .
Figure 3. T-distributed stochastic neighbor embedding (t-SNE) of the testing database (N ¼ 200).(A) Two data clusters are shown in the two-dimensional latent feature space, with one cluster corresponding to normal distal tibial images and the other cluster corresponding to the distal tibial images with CMLs.The spatial separation of these two clusters in the latent space (with minimal overlap) suggested discernible differences in the imaging appearances of these two types of images.(B) In contrast, the substantial overlap in the data distributions of the synthetic and real images within the two-dimensional latent feature space indicated mark similarities in the imaging appearances of real and synthetic images.

Figure 4 .
Figure 4. ROC curves for comparing two data distributions in the latent feature space (as derived from t-SNE): (A) classic metaphyseal lesion (CML) versus normal; and (B) real versus synthetic.Abbreviations: ROC ¼ receiver operating characteristic, t-SNE ¼ t-distributed stochastic neighbor embedding.

Figure 5 .
Figure 5. ROC curves based on correlated models that combined data from all three radiologists in the two classification tasks: (A) classic metaphyseal lesion (CML) versus normal; and (B) real versus synthetic.Abbreviation: ROC ¼ receiver operating characteristic.

Figure 6 .
Figure 6.Histogram of the ensemble interpretations by the three radiologists in the two classification tasks: classic metaphyseal lesion (CML) versus normal (top row: A, B, and C), and real versus synthetic (bottom row: D,E, and F).The following four categories were used: 0, no radiologist agreed with the ground truth; 1, one radiologist agreed with the ground truth; 2, two radiologists agreed with the ground truth; 3, three radiologists agreed with the ground truth.

Table 1 .
Performance of the three radiologists in determining whether or not the images have classic metaphyseal lesions (CMLs).

Table 4 .
Combined estimates of sensitivity and specificity of three radiologists in the two reading sessions using the hierarchical summary receiver operating characteristic method.

Table 2 .
Performance of the three radiologists in determining whether the distal tibial radiographic images were real or synthetic.

Table 3 .
Combined estimates of accuracy, sensitivity, and specificity of three radiologists in the two reading sessions using separate population average (marginal) models.